Take Home 3

Author

Guan Jhen LIN

Published

June 8, 2024

Modified

June 21, 2024

Credit: John Blank Photography

1. Introduction

FishEye International monitors business records of commercial fishing operators in Oceanus to identify and prevent illegal fishing. Analysts work with company data, including ownership, shareholders, transactions, and products/services, to build the CatchNet Knowledge Graph.

Last year, SouthSeafood Express Corp was caught illegally fishing, leading to its closure. FishEye wants to understand the temporal patterns and infer how the fishing market reacted to this event, as some businesses may have tried to capture SouthSeafood’s market share, while others may have become more cautious about illegal activities.

FishEye aims to develop visualization tools for CatchNet to identify influential people in business networks, considering the varied and changing shareholder and ownership relationships.

2. Mini-Challenge 3: Temporal Analysis

2.1 Tasks and Questions:

  • Develop an approach using visual analytics to highlight temporal patterns and changes in corporate structures, focusing on identifying the most active individuals and businesses.

  • Utilize visualizations to display typical and atypical business transactions, such as mergers and acquisitions, and infer the underlying motivations behind changes in their activity levels.

Data Source: VAST Challenge 2024: Mini-Challenge 3

3. Data

The MC3 dataset is a comprehensive collection comprising 60,520 nodes (entities) and 75,817 edges (relationships or connections) organized into 4,782 distinct components. The nodes in this dataset represent various types of entities, including individuals (Person), chief executive officers (CEO), companies, and other organizational structures.

On the other hand, the edges capture different types of relationships or interactions between these nodes. Some examples of edge types include shareholdership (ownership of shares in a company), beneficial ownership (enjoying the benefits of owning a property or asset without being the legal owner), and potentially other forms of associations or transactions.

The dataset is structured such that the nodes key contains a list representing all the node entities, with each node carrying attributes or properties that describe its characteristics, such as ID, type, country, revenue, founding date, and potentially other relevant details.

Correspondingly, the links key holds a list that represents all the edges or connections between these nodes. Each edge entry typically includes properties like the edge type, start and end dates (if applicable), and identifiers for the source and target nodes involved in the relationship.

With this comprehensive dataset capturing both node entities and their interconnections through various types of edges, researchers and analysts can conduct in-depth analyses to uncover patterns, understand the dynamics of relationships, and gain insights into the complex network of entities and their interactions.

3.1 Data Preparations

  • jsonlite: This package provides functionality for parsing and generating JSON data in R.
  • dplyr: A part of the tidyverse, dplyr provides a consistent set of verbs for data manipulation, making it easier to transform and summarize data frames.
  • tidyr: Another tidyverse package, tidyr helps in creating tidy data sets by providing functions for reshaping data frames.
  • stringr: This package provides a cohesive set of functions for string manipulation and regular expressions in R.
  • lubridate: lubridate is designed to make it easier to work with date-time data in R.
  • tidyverse: The tidyverse is a collection of R packages designed for data science, including dplyr, ggplot2, tidyr, and others.
  • readtext: readtext makes it easier to import and handle text data in R, particularly for text mining and analysis.
  • ggplot2: Part of the tidyverse, ggplot2 is a powerful data visualization package for creating complex and publication-ready plots.
  • visNetwork: visNetwork is a package for creating interactive network visualizations in R, using the vis.js library.
  • ggpubr: ggpubr provides easy-to-use functions for creating publication-ready plots and combining multiple ggplot2 plots into a single figure.
  • igraph: igraph is a collection of network analysis tools for creating and analyzing graphs and networks in R.
  • patchwork: patchwork is a package for composing multiple ggplot2 plots into a single figure, with easy layout control.

  • igraph: A package for creating and analysing graphs and networks.

  • ggraph: A package for creating graph-based data visualisations using the ‘ggplot2’ syntax.

  • ggrepel: A package for automatically adjusting text labels to avoid overlapping in ‘ggplot2’ visualisations.

Click to show code
pacman::p_load(jsonlite,dplyr,tidyr,stringr,lubridate,tidyverse,readtext,ggplot2,visNetwork,stringr,ggpubr, igraph, patchwork,igraph,ggraph,ggrepel)
Click to show code
mc3 <- fromJSON("data/mc3.json")

To effectively perform temporal analysis on such data, it can be beneficial to separate the data set into two distinct data frames: one for the node entities and another for the relationships or edges between them.

Click to show code
# Load necessary libraries
library(dplyr)
library(tidyr)
library(stringr)

# Assuming mc3 is already loaded as a list containing nodes dataframe
mc3_nodes <- as_tibble(mc3$nodes) %>%
  mutate(
    id = as.character(id),
    type = as.character(type),
    country = as.character(country),
    ProductServices = as.character(ProductServices),
    revenue_omu = as.numeric(revenue),
    head_of_org = as.character(HeadOfOrg),
    TradeDescription = as.character(TradeDescription),
    PointOfContact = as.character(PointOfContact),
    id = gsub("^[0-9]+\\.\\s*", "", id),  # Clean up IDs
    revenue_omu = ifelse(is.na(revenue_omu), 0, revenue_omu)
  ) %>%
  mutate(
    type = ifelse(type == "Entity.Person", "Entity.Person.Person", type)  # Rename Entity.Person
  ) %>%
  rename(
    last_edited_by = `_last_edited_by`,
    last_edited_date = `_last_edited_date`,
    date_added = `_date_added`,
    raw_source = `_raw_source`,
    algorithm = `_algorithm`
  ) %>%
  select(everything())

# Display the first few rows of the updated dataframe
print(head(mc3_nodes))
# A tibble: 6 × 17
  type    country ProductServices PointOfContact HeadOfOrg founding_date revenue
  <chr>   <chr>   <chr>           <chr>          <chr>     <chr>           <dbl>
1 Entity… Uziland Unknown         Rebecca Lewis  Émilie-S… 1954-04-24T0…   5995.
2 Entity… Mawala… Furniture and … Michael Lopez  Honoré L… 2009-06-12T0…  71767.
3 Entity… Uzifri… Food products   Steven Robert… Jules La… 2029-12-15T0…      0 
4 Entity… Islava… Unknown         Anthony Wyatt  Dr. Víct… 1972-02-16T0…      0 
5 Entity… Oceanus Unknown         Tracy Schmidt  Jacques … 1954-04-06T0…   4747.
6 Entity… Imazam  Fish, crustace… Corey Moore J… Thierry … 2031-09-30T0…  46567.
# ℹ 10 more variables: TradeDescription <chr>, last_edited_by <chr>,
#   last_edited_date <chr>, date_added <chr>, raw_source <chr>,
#   algorithm <chr>, id <chr>, dob <chr>, revenue_omu <dbl>, head_of_org <chr>
(mc3_nodes$type) %>% unique()
[1] "Entity.Organization.Company"         
[2] "Entity.Organization.LogisticsCompany"
[3] "Entity.Organization.FishingCompany"  
[4] "Entity.Organization.FinancialCompany"
[5] "Entity.Organization.NewsCompany"     
[6] "Entity.Organization.NGO"             
[7] "Entity.Person.Person"                
[8] "Entity.Person.CEO"                   
# Further processing to separate the 'type' column
mc3_nodes <- mc3_nodes %>%
  separate(type, into = c("type_1", "type_2", "type_3"), sep = "\\.", fill = "right", extra = "drop")

# Display the first few rows of the updated dataframe
print(head(mc3_nodes))
# A tibble: 6 × 19
  type_1 type_2       type_3  country   ProductServices PointOfContact HeadOfOrg
  <chr>  <chr>        <chr>   <chr>     <chr>           <chr>          <chr>    
1 Entity Organization Company Uziland   Unknown         Rebecca Lewis  Émilie-S…
2 Entity Organization Company Mawalara  Furniture and … Michael Lopez  Honoré L…
3 Entity Organization Company Uzifrica  Food products   Steven Robert… Jules La…
4 Entity Organization Company Islavara… Unknown         Anthony Wyatt  Dr. Víct…
5 Entity Organization Company Oceanus   Unknown         Tracy Schmidt  Jacques …
6 Entity Organization Company Imazam    Fish, crustace… Corey Moore J… Thierry …
# ℹ 12 more variables: founding_date <chr>, revenue <dbl>,
#   TradeDescription <chr>, last_edited_by <chr>, last_edited_date <chr>,
#   date_added <chr>, raw_source <chr>, algorithm <chr>, id <chr>, dob <chr>,
#   revenue_omu <dbl>, head_of_org <chr>
(mc3_nodes$type_3) %>% unique()
[1] "Company"          "LogisticsCompany" "FishingCompany"   "FinancialCompany"
[5] "NewsCompany"      "NGO"              "Person"           "CEO"             
Click to show code
mc3_edges <- as_tibble(mc3$links) %>%
  mutate(
    type = as.character(type),
    type_new = stringr::str_extract(type, "[^.]+$"),
    source = as.character(source),
    target = as.character(target),
    last_edited_by = as.character(`_last_edited_by`),
    last_edited_date = as.Date(`_last_edited_date`),
    date_added = as.Date(`_date_added`),
    raw_source = as.character(`_raw_source`),
    algorithm = as.character(`_algorithm`)
  ) %>%
  mutate(
    start_date = as.Date(start_date), 
    end_date = if_else(is.na(end_date), as.Date(NA), as.Date(end_date)),
    id = target
  ) %>%
  # Remove `_last_edited_by`, `_last_edited_date`, `_date_added`, `_raw_source`, `_algorithm` if they exist
  select(-c(`_last_edited_by`, `_last_edited_date`, `_date_added`, `_raw_source`, `_algorithm`)) %>%

  rename(
    last_edited_by = last_edited_by,
    last_edited_date = last_edited_date,
    date_added = date_added,
    raw_source = raw_source,
    algorithm = algorithm
  ) %>%

  select(everything())
Click to show code
  mc3_edges <- mc3_edges %>%
    mutate(
      start_date = as.POSIXct(start_date),
      last_edited_date = as.POSIXct(last_edited_date),
      date_added = as.POSIXct(date_added),
      end_date = if_else(is.na(end_date), as.POSIXct(NA), as.POSIXct(end_date)),
      id = target
    )

mc3_nodes <- mc3_nodes %>%
  mutate(
    last_edited_date = as.POSIXct(last_edited_date),
    date_added = as.POSIXct(date_added),
    founding_date = as.POSIXct(founding_date),
    dob = as.POSIXct(dob)
  )
Click to show code
mc3_edges [duplicated(mc3_edges ),]
# A tibble: 0 × 13
# ℹ 13 variables: start_date <dttm>, type <chr>, source <chr>, target <chr>,
#   key <int>, end_date <dttm>, type_new <chr>, last_edited_by <chr>,
#   last_edited_date <dttm>, date_added <dttm>, raw_source <chr>,
#   algorithm <chr>, id <chr>
mc3_nodes [duplicated(mc3_nodes ),]
# A tibble: 0 × 19
# ℹ 19 variables: type_1 <chr>, type_2 <chr>, type_3 <chr>, country <chr>,
#   ProductServices <chr>, PointOfContact <chr>, HeadOfOrg <chr>,
#   founding_date <dttm>, revenue <dbl>, TradeDescription <chr>,
#   last_edited_by <chr>, last_edited_date <dttm>, date_added <dttm>,
#   raw_source <chr>, algorithm <chr>, id <chr>, dob <dttm>, revenue_omu <dbl>,
#   head_of_org <chr>
Click to show code
nodes <- mc3_nodes %>%
  select(
    type_2, type_3, id, dob, country, 
    head_of_org, revenue, last_edited_by, 
    last_edited_date, date_added, raw_source, algorithm
  )

edges <- mc3_edges %>%
  select(
    type_new, id, start_date, end_date, 
    last_edited_by, last_edited_date, date_added, 
    raw_source, algorithm
  )

What are the Company Types?

# Count the unique values in type_2
type_3_unique_counts <- mc3_nodes %>%
  group_by(type_3) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Display the summary table
print(type_3_unique_counts)
# A tibble: 8 × 2
  type_3           count
  <chr>            <int>
1 Person           50356
2 Company           7927
3 CEO               1293
4 FishingCompany     600
5 LogisticsCompany   311
6 FinancialCompany    23
7 NGO                  5
8 NewsCompany          5
duplicate_counts <- mc3_edges %>%
  group_by(target, type_new) %>%
  summarise(count = n(), .groups = 'drop') %>%
  filter(count > 1)

# Display the result
print(duplicate_counts)
# A tibble: 13,220 × 3
   target                      type_new            count
   <chr>                       <chr>               <int>
 1 6. America Transit Plc      Shareholdership         2
 2 Abbott, Mcbride and Edwards BeneficialOwnership     2
 3 Abbott, Mcbride and Edwards Shareholdership         6
 4 Abbott, Mcbride and Edwards WorksFor                3
 5 Abbott-Harrison             BeneficialOwnership     4
 6 Abbott-Harrison             Shareholdership         2
 7 Abbott-Ibarra               Shareholdership         4
 8 Abbott-Sullivan             Shareholdership         7
 9 Abbott-Sullivan             WorksFor                3
10 Acevedo PLC                 BeneficialOwnership     2
# ℹ 13,210 more rows
duplicate_counts <- mc3_edges %>%
  group_by(source, target, type_new) %>%
  summarise(count = n(), .groups = 'drop') %>%
  filter(count > 1)

# Display the result
print(duplicate_counts)
# A tibble: 2 × 4
  source          target                                type_new           count
  <chr>           <chr>                                 <chr>              <int>
1 Akira Hernandez PregolyaDredge Logistics Incorporated BeneficialOwnersh…     2
2 Isla Davis      Parks Ltd                             BeneficialOwnersh…     2

Extracting the top 10 id to see the different company types in edges and nodes file.

Click to show code
top_10_ids <- edges %>%
  group_by(id) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:10)

Create a bar chart for nodes and edges.

Click to show code
barchart_type_counts_nodes <- mc3_nodes %>%
  count(type_3, sort = TRUE)

barchart_type_counts_edges <- mc3_edges %>%
  count(type_new, sort = TRUE)

b1 <- ggplot(barchart_type_counts_nodes, aes(x = reorder(type_3, -n), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    title = "Distribution of Nodes Types",
    x = "Nodes Type",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.8, size = 7)
  )

b2 <- ggplot(barchart_type_counts_edges, aes(x = reorder(type_new, -n), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    title = "Distribution of Edges Types",
    x = "Edge Type",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 7)
  )
combined_plot <- b1 / b2
print(combined_plot)

💡Observations:
Nodes Type:
Person,’ has count exceeding 50,000 and is the most common entities in the network. ‘Companies’ numbering around 5,000
CEO’ are present but much fewer, reflecting a small proportion of individuals with this specific role.
Other organization types, such as ‘FishingCompany,’ ‘LogisticsCompany,’ ‘FinancialCompany,’ ‘NewsCompany,’ and ‘NGO,’ are present in minimal numbers.
Edges Type:
Shareholdership,’ has over 35,000 instances, indicating ownership stakes in companies as a primary form of interaction.
BeneficialOwnership’ is the second most common relationship, with around 20,000 instances. ‘Worksfor’ is slightly lesser than beneficialownership with around 15,000 instances.
FamilyRelationship’ connections are the least represented, suggesting that familial ties are not as prominent as business and ownership relations in the dataset.

Companies and Relationship to Types

Network Graph

Click to show code
library(dplyr)
library(igraph)
library(visNetwork)

# Assuming mc3_edges is already loaded as a dataframe

# Get the top 20 companies based on the count of edges
top_companies <- mc3_edges %>%
  count(id, sort = TRUE) %>%
  top_n(20, wt = n)

# Filter edges to include only those involving the top 20 companies
filtered_edges <- mc3_edges %>%
  filter(id %in% top_companies$id) %>%
  select(id, type_new)

# Prepare edges for graph creation
edges_for_graph <- filtered_edges %>%
  rename(from = id, to = type_new)

# Create a bipartite graph
bipartite_graph <- graph_from_data_frame(d = edges_for_graph, directed = FALSE)

# Assign types for bipartite mapping
V(bipartite_graph)$type <- bipartite_mapping(bipartite_graph)$type

# Prepare nodes for visualization
nodes_vis <- data.frame(
  id = V(bipartite_graph)$name, 
  label = V(bipartite_graph)$name, 
  group = ifelse(V(bipartite_graph)$type, "Type", "Company")
)

# Prepare edges for visualization
edges_vis <- igraph::as_data_frame(bipartite_graph, what = "edges")

# Create and customize the visNetwork graph
vis1 <- visNetwork(nodes_vis, edges_vis, width = "100%", height = "800px") %>%
        visNodes(shape = "dot", scaling = list(label = list(enabled = TRUE))) %>%
        visEdges(arrows = "none", color = list(color = "lightgray", highlight = "red")) %>%
        visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
        visLegend() %>%
        visGroups(groupname = "Type", shape = "dot", color = "blue") %>%
        visGroups(groupname = "Company", shape = "dot", color = "green") %>%
        visLayout(randomSeed = 21) %>%
        visInteraction(navigationButtons = TRUE) %>%
        visPhysics(stabilization = FALSE, enabled = FALSE)
# Display the graph
vis1
💡Observations:
The graph helps in understanding the complex web of interactions in the dataset, highlighting how entities are interconnected through various types of relationships. Each node represents an entity, such as a company or individual, and the edges depict the relationships between them, including shareholdership, beneficial ownership, and employment links. For example, Mosley and Sons has connection to shareholdership, worksfor and beneficial ownership while Nguyen, Shelton and Hayes only has connection to shareholdership and beneficial ownership.

Overlapping Business Types?

Click to show code
top_10_ids <- mc3_edges %>%
  group_by(id) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:10)

filtered_edges <- mc3_edges %>%
  filter(id %in% top_10_ids$id)

type_new_counts <- filtered_edges %>%
  group_by(id, type_new) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(desc(count))

type_new_colors <- c("Shareholdership" = "pink", 
                     "BeneficialOwnership" = "lightblue", 
                     "WorksFor" = "lightgreen")

wrap_text <- function(text, width) {
  wrapped_text <- str_replace_all(text, "-", " ")
  wrapped_text <- str_wrap(wrapped_text, width = width)
  wrapped_text <- str_replace_all(wrapped_text, " ", "-")
  return(wrapped_text)
}

type_new_counts <- type_new_counts %>%
  mutate(id_wrapped = wrap_text(id, 10)) 

type_new_counts <- type_new_counts %>%
  mutate(id_wrapped = factor(id_wrapped, levels = type_new_counts %>%
                               group_by(id_wrapped) %>%
                               summarize(total_count = sum(count)) %>%
                               arrange(desc(total_count)) %>%
                               pull(id_wrapped)))

b4 <- ggplot(type_new_counts, aes(x = id_wrapped, y = count, fill = type_new)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = 0.8, size = 3) + # Add text labels
  labs(x = " ", y = " ", title = "Top Names with Multiple Type Relationships") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8),  
    legend.position = "bottom"
  ) +
  scale_fill_manual(values = type_new_colors, name = "Type")
print(b4)

💡Observations:
This bar chart visualises the top names with multiple types of relationships in the network. The x-axis lists the top entities, while the y-axis represents the count of different relationship types, illustrating how some organisations or individuals are central to multiple types of interactions within the network.
Downs Group and Mills, Atkinson and Chavez have the highest number of relationships, indicating their significant roles in various types of network interactions.

Which Company has the highest Revenue?

Click to show code
library(plotly)
company_revenue <- mc3_nodes %>%
  group_by(id, type_3) %>%
  summarise(total_revenue = sum(revenue_omu, na.rm = TRUE), .groups = 'drop') 

plot <- plot_ly(data = company_revenue, 
                x = ~id,  
                y = ~total_revenue, 
                text = ~paste("Name:", id, 
                              "<br>Total Revenue:", total_revenue,
                              "<br>Type:", type_3), 
                type = "scatter",  
                mode = "markers",  
                marker = list(size = ~total_revenue / 1000000,  
                              sizemode = 'area', 
                              sizeref = 0.1, 
                              color = ~total_revenue, 
                              colorscale = "Viridis", 
                              showscale = TRUE)) 

plot <- plot %>%
  layout(title = "Interactive Bubble Chart of Company Revenue",
         xaxis = list(title = "", showticklabels = FALSE),  # Hide x-axis labels
         yaxis = list(title = "Total Revenue"),
         hovermode = "closest")
plot
💡Observations:
This bubble chart visualises the distribution of total revenue for different entities in the network. Each bubble represents an entity, with the size of the bubble corresponding to the total revenue and the colour indicating the magnitude of the revenue. The largest yellow bubble represents Briggs-Wilson, a company with a total revenue of approximately 310.6 million.

Irregular Patterns by Revenue

Extracting year from full date as using the full date format is too detail for analysis.

edges <- edges %>%
  mutate(
    Start_Year = year(start_date),
    End_Year = year(end_date)
  )

nodes <- nodes %>%
  mutate(added_year = year(date_added))

The id with top 8 revenues had been extracted and visualise using heat map.

Click to show code
edges <- edges %>%
  mutate(
    Start_Year = year(start_date),
    End_Year = year(end_date)
  )

nodes <- nodes %>%
  mutate(added_year = year(date_added))

top_8_revenue_nodes <- nodes %>%
  arrange(desc(revenue)) %>%
  slice_head(n = 8)

MC2_node_abnor <- edges %>%
  filter(id %in% top_8_revenue_nodes$id) %>%
  group_by(id, Start_Year) %>%
  summarise(weight = n(), .groups = "drop")

MC2_node_abno <- edges %>%
  filter(id %in% top_8_revenue_nodes$id) %>%
  group_by(id, Start_Year) %>%
  summarise(weight = n(), .groups = "drop")

g1 <- ggplot(MC2_node_abnor, aes(Start_Year, id)) +
  geom_tile(aes(fill = weight)) +
  geom_text(aes(label = weight), size = 3) +
  labs(title = "Irregular Pattern by Revenue") +
  scale_fill_gradient(low = "white", high = "lightblue") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 8)) +
  theme(axis.text.y = element_text(size = 6))
print(g1)

💡Observations:
This heat map visualises the irregular patterns in revenue for the top 8 revenue companies over time. Each cell represents a specific entity and year, with the numbers indicating the weight or frequency of revenue changes. More details of transaction and relationship examination should be done to check any atypical business activities.

- Williams, Stanley and Butlet had a unusually high activity count of 9 in 2025.
- Yang PLC had peaks in 2012, 2021 and 2022.
- Short, Hernandez and Myers had irregular activity peaks at 2011 and 2020.
- Roth Ltd had isolated peaks in 2010 and 2025.
- Pearson-Williams had high activity during 2015 to 2017 with a drop off after.
- Dunlap, Fleming and Brown had significant peak in 2025 but low activity overall.
- Cooper, Jacobs and Gonzalez had high activities from 2018 to 2021.
- Briggs-Wilson had peaks in 2010 and 2028 but low activity overall.

Irregular Activities of Business

Click to show code
filtered_edges <- edges %>%
  filter(!is.na(start_date) & !is.na(end_date))

filtered_edges <- filtered_edges %>%
  mutate(year = year(start_date))

activity_counts <- filtered_edges %>%
  group_by(year, id, type_new) %>%
  summarise(activity_count = n()) %>%
  ungroup()

activity_counts <- activity_counts %>%
  group_by(type_new) %>%
  mutate(
    Q1 = quantile(activity_count, 0.25),
    Q3 = quantile(activity_count, 0.75),
    IQR = Q3 - Q1,
    lower_bound = Q1 - 1.5 * IQR,
    upper_bound = Q3 + 1.5 * IQR,
    is_outlier = activity_count < lower_bound | activity_count > upper_bound
  ) %>%
  ungroup()

o1 <- ggplot(activity_counts, aes(x = year, y = activity_count, color = is_outlier)) +
      geom_point() +
      geom_text_repel(aes(label = ifelse(is_outlier, id, "")), size = 2.5) +
      facet_wrap(~ type_new) +
      labs(title = "Activity Counts per Year with Outliers Highlighted",
           x = "Year", y = "Activity Count", color = "Outlier") +
      theme_minimal()
print(o1)

Click to show code
filtered_edges <- edges %>%
  filter(!is.na(start_date) & !is.na(end_date))

filtered_edges <- filtered_edges %>%
  mutate(year = year(start_date))

activity_counts <- filtered_edges %>%
  group_by(year, id, type_new) %>%
  summarise(activity_count = n()) %>%
  ungroup()

activity_counts <- activity_counts %>%
  group_by(type_new) %>%
  mutate(
    Q1 = quantile(activity_count, 0.25),
    Q3 = quantile(activity_count, 0.75),
    IQR = Q3 - Q1,
    lower_bound = Q1 - 1.5 * IQR,
    upper_bound = Q3 + 1.5 * IQR,
    is_outlier = activity_count < lower_bound | activity_count > upper_bound
  ) %>%
  ungroup()

ggplot(activity_counts, aes(x = year, y = activity_count, color = is_outlier)) +
  geom_point() +
  geom_text_repel(aes(label = ifelse(is_outlier, id, "")), size = 2.5) +
  facet_wrap(~ type_new, scales = "free_y") +  # Using free_y scale for better visualization
  labs(title = "Activity Counts per Year with Outliers Highlighted",
       x = "Year", y = "Activity Count", color = "Outlier") +
  scale_color_manual(values = c("FALSE" = "blue", "TRUE" = "red")) +  # Color coding outliers
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold")
  )

💡Observations:
The scatter plot displays activity counts per year, categorised by BeneficialOwnership, Shareholdership, and WorksFor relationships. The y-axis represents the activity count, while the x-axis shows the year. Outliers (TRUE), highlighted in blue, indicate entities with unusually high activity levels and indicates unusual activities.
Stephens-Lopez stands out with multiple outliers around 2019 onwards and indicating a substantial increase in activities.
Wright LLC showed increase activities since 2018 onwards and these activities were flagged as outliers indicating a potential unusual ownership transfer or increase in ownership stakes for.
Smith-Ramirez showed a sudden rise around 2020
Haas-Tran and Blair PLC display outliers around 2030 indicating potential abnormal business activities.
outliers_summary <- activity_counts %>%
  filter(is_outlier) %>%
  select(year, id, type_new, activity_count) %>%
  arrange(type_new, year, desc(activity_count))
print(outliers_summary)
# A tibble: 31 × 4
    year id              type_new            activity_count
   <dbl> <chr>           <chr>                        <int>
 1  2014 Wright LLC      BeneficialOwnership              2
 2  2015 Wright LLC      BeneficialOwnership              2
 3  2017 Nichols-Esparza BeneficialOwnership              2
 4  2022 Wright LLC      BeneficialOwnership              2
 5  2023 Nichols-Esparza BeneficialOwnership              2
 6  2027 Smith-Ramirez   BeneficialOwnership              4
 7  2029 Smith-Ramirez   BeneficialOwnership              2
 8  2030 Smith-Ramirez   BeneficialOwnership              2
 9  2031 Smith-Ramirez   BeneficialOwnership              5
10  2032 Stephens-Lopez  BeneficialOwnership              7
# ℹ 21 more rows
ggplot(outliers_summary, aes(x = year, y = id, fill = activity_count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "red") +
  facet_wrap(~ type_new, scales = "free_y") +
  labs(title = "Heatmap of Outliers in Activity Counts",
       x = "Year", y = "Entity", fill = "Activity Count") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold")
  )

💡Observations:
This heatmap illustrates outliers in activity counts for BeneficialOwnership and Shareholdership relationships. Entities like Wright LLC and Stephens-Lopez exhibit significant outliers in BeneficialOwnership, while Woods Inc and Weaver-Price show unusual activity in Shareholdership. The activity counts have increased since late 2020, indicating abnormal patterns.

Conclusions

  • The visual analytics approach effectively highlights temporal patterns and changes within companies. However, continuous monitoring and in-depth investigations are necessary to uncover the underlying motivations behind these changes.

  • Transforming complex data into meaningful visualisations is challenging. Each graph felt like peeling back layers to reveal the intricate web of corporate activities, much like solving a puzzle. This process requires a blend of technical skills and creative thinking, but it ultimately provides valuable insights to better understand the network.